Fault Tolerant Master-Worker over a Multi-Cluster Architecture

نویسندگان

  • J. Rodrigues de Souza
  • Eduardo Argollo
  • Angelo Duarte
  • Dolores Rexachs
  • Emilio Luque
چکیده

The growth of clusters into cluster collections increases potential points of failures, requiring the implementation of a fault-tolerance scheme. The CoHNOW is organized as a hierarchical master-worker scheme and clusters may be geographically distributed and interconnected by Internet. This paper describes a system of Fault-Tolerant protection by Data Replication (FT-DR), based on preserving critical functions by on-line dynamic data replication. The system-model target is to detect failures in any of the system functional elements and to tolerate this failure by recovering system consistency, guaranteeing the completion of the work in progress (recovery procedure). The model is designed to tolerate more than one simultaneous failure. There are three distinct phases for model-fault tolerance activities: startup, normal execution including failure detection monitoring, and failure recovery. The system is oriented for general master-worker applications running on CoHNOW and is transparent both for user and application. The master-worker environment requirements to support all these capabilities and the runtime overhead are under evaluation.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A New Fault Tolerant Nonlinear Model Predictive Controller Incorporating an UKF-Based Centralized Measurement Fusion Scheme

A new Fault Tolerant Controller (FTC) has been presented in this research by integrating a Fault Detection and Diagnosis (FDD) mechanism in a nonlinear model predictive controller framework. The proposed FDD utilizes a Multi-Sensor Data Fusion (MSDF) methodology to enhance its reliability and estimation accuracy. An augmented state-vector model is developed to incorporate the occurred senso...

متن کامل

Distributed Parallel Processing Based on Master/Worker Model in Heterogeneous Computing Environment

Due to the complexity and varying requirements of the applications to utilize large-scale computing resources, there are several issues such as aggregating heterogeneous computing resources, easy-touse programming model and fault-tolerant mechanism that need to be addressed. This paper presents a general distributed parallel processing architecture based on master/worker model, and it can aggre...

متن کامل

Parallelization of K-Means Clustering on Multi-Core Processors

Multi-core processors have recently been available on most personal computers. To get the maximum benefit of computational power from the multi-core architecture, we need a new design on existing algorithms and software. In this paper we propose the parallelization of the well-known k-means clustering algorithm. We employ a single program multiple data (SPMD) approach based on a message passing...

متن کامل

Development and Performance Analysis of a Fault Tolerant Algorithm for Cluster of Workstations

A Cluster of Workstations (COW) is network based multi-computer system, which is the most prominent distributed memory system aimed to replace supercomputers. A cluster of workstations can be viewed as a single machine in which one job is divided into n subtasks and delegated to n workstations in the COW architecture. To get the job completed, all subtasks assigned to component workstations mus...

متن کامل

Scalable And Fault Tolerant Hierarchical B&B Algorithms For Computational Grids

Solving to optimality large instances of combinatorial optimization problems using Branch and Bound (B&B) algorithms requires a huge amount of computing resources. Nowadays, such power is provided by large scale environments such as computational grids. However, grids induce new challenges: scalability, heterogeneity, and fault tolerance. Most of existing gridbased B&Bs are developed using the ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005